While today's video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing video architectures can only process <5 seconds of a video without hitting the computation or memory bottlenecks. In this paper, we propose a new strategy to overcome this challenge. Instead of trying to process more frames at once like most existing methods, we propose to process videos in an online fashion and cache "memory" at each iteration. Through the memory, the model can reference prior context for long-term modeling, with only a marginal cost. Based on this idea, we build MeMViT, a Memory-augmented Multiscale Vision Transformer, that has a temporal support 30x longer than existing models with only 4.5% more compute; traditional methods need >3,000% more compute to do the same. On a wide range of settings, the increased temporal support enabled by MeMViT brings large gains in recognition accuracy consistently. MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets. Code and models are available at https://github.com/facebookresearch/memvit.
translated by 谷歌翻译
视觉识别的“咆哮20S”开始引入视觉变压器(VITS),这将被取代的Cummnets作为最先进的图像分类模型。另一方面,vanilla vit,当应用于一般计算机视觉任务等对象检测和语义分割时面临困难。它是重新引入多个ConvNet Priors的等级变压器(例如,Swin变压器),使变压器实际上可作为通用视觉骨干网,并在各种视觉任务上展示了显着性能。然而,这种混合方法的有效性仍然在很大程度上归功于变压器的内在优越性,而不是卷积的固有感应偏差。在这项工作中,我们重新审视设计空间并测试纯粹的Convnet可以实现的限制。我们逐渐“现代化”标准Reset朝着视觉变压器的设计设计,并发现几个有助于沿途绩效差异的关键组件。此探索的结果是一个纯粹的ConvNet型号被称为ConvNext。完全由标准的Convnet模块构建,ConvNexts在准确性和可扩展性方面与变压器竞争,实现了87.8%的ImageNet Top-1精度和表现优于COCO检测和ADE20K分割的Swin变压器,同时保持了标准Convnet的简单性和效率。
translated by 谷歌翻译
我们呈现蒙版特征预测(MaskFeat),用于自我监督的视频模型的预训练。我们的方法首先随机地掩盖输入序列的一部分,然后预测蒙面区域的特征。我们研究五种不同类型的功能,找到面向导向渐变(HOG)的直方图,手工制作的特征描述符,在性能和效率方面尤其良好。我们观察到猪中的局部对比标准化对于良好的结果至关重要,这与使用HOG进行视觉识别的早期工作符合。我们的方法可以学习丰富的视觉知识和基于大规模的变压器的模型。在不使用额外的模型重量或监督的情况下,在未标记视频上预先培训的MaskFeat在动力学-400上使用MVIT-L达到86.7%的前所未有的结果,在动力学-600,88.3%上,88.3%,在动力学-700,88.8地图上SSV2上的75.0%。 MaskFeat进一步推广到图像输入,其可以被解释为具有单个帧的视频,并在想象中获得竞争结果。
translated by 谷歌翻译
在本文中,我们将多尺度视觉变压器(MVIT)作为图像和视频分类的统一架构,以及对象检测。我们提出了一种改进的MVIT版本,它包含分解的相对位置嵌入和残余汇集连接。我们以五种尺寸实例化此架构,并评估Imagenet分类,COCO检测和动力学视频识别,在此优先效果。我们进一步比较了MVITS的汇集注意力来窗口注意力机制,其中它在准确性/计算中优于后者。如果没有钟声,MVIT在3个域中具有最先进的性能:ImageNet分类的准确性为88.8%,Coco对象检测的56.1盒AP和动力学-400视频分类的86.1%。代码和模型将公开可用。
translated by 谷歌翻译
To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We propose a long-term feature bank-supportive information extracted over the entire span of a video-to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades. Code is available online. 1 1 https://github.com/facebookresearch/ video-long-term-feature-banks Input clip (4 seconds) Target frame
translated by 谷歌翻译
Deep embeddings answer one simple question: How similar are two images? Learning these embeddings is the bedrock of verification, zero-shot learning, and visual search. The most prominent approaches optimize a deep convolutional network with a suitable loss function, such as contrastive loss or triplet loss. While a rich line of work focuses solely on the loss functions, we show in this paper that selecting training examples plays an equally important role. We propose distance weighted sampling, which selects more informative and stable examples than traditional approaches. In addition, we show that a simple margin based loss is sufficient to outperform all other loss functions. We evaluate our approach on the Stanford Online Products, CAR196, and the CUB200-2011 datasets for image retrieval and clustering, and on the LFW dataset for face verification. Our method achieves state-of-the-art performance on all of them.
translated by 谷歌翻译
A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning.
translated by 谷歌翻译
Weakly-supervised object localization aims to indicate the category as well as the scope of an object in an image given only the image-level labels. Most of the existing works are based on Class Activation Mapping (CAM) and endeavor to enlarge the discriminative area inside the activation map to perceive the whole object, yet ignore the co-occurrence confounder of the object and context (e.g., fish and water), which makes the model inspection hard to distinguish object boundaries. Besides, the use of CAM also brings a dilemma problem that the classification and localization always suffer from a performance gap and can not reach their highest accuracy simultaneously. In this paper, we propose a casual knowledge distillation method, dubbed KD-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention (CI), which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the de-biased object feature, we additionally propose a multi-teacher causal distillation framework to balance the absorption of classification knowledge and localization knowledge during model training. Extensive experiments on several benchmarks demonstrate the effectiveness of KD-CI-CAM in learning clear object boundaries from confounding contexts and addressing the dilemma problem between classification and localization performance.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
translated by 谷歌翻译